Edit Distance with Duplications and Contractions Revisited

نویسندگان

  • Tamar Pinhas
  • Dekel Tsur
  • Shay Zakov
  • Michal Ziv-Ukelson
چکیده

In this paper, we propose three algorithms for the problem of string edit distance with duplication and contraction operations, which improve the time complexity of previous algorithms for this problem. These include a faster algorithm for the general case of the problem, and two improvements which apply under certain assumptions on the cost function. The general algorithm is based on fast min-plus multiplication of square matrices, and obtains the running time of O ( |Σ|n log logn log2 n ) , where n is the length of the input strings and |Σ| is the alphabet size. This algorithm is further accelerated, under some assumption on the cost function, to O ( |Σ| ( n + nn ′2 log logn′ log2 n′ )) time, where n′ is the length of the run-length encoding of the input. Another improvement is based on a new fast matrix-vector min-plus multiplication under a certain discreteness assumption, and yields an O ( |Σ| n 3 log2 n ) time algorithm. Furthermore, this algorithm is online, in the sense that one of the strings may be given letter by letter. As part of this algorithm we present the currently fastest online algorithm for weighted CFG parsing for discrete weighted grammars. This result is useful on its own.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Genomic Distances under Deletions and Insertions

As more and more genomes are sequenced, evolutionary biologists are becoming increasingly interested in evolution at the level of whole genomes, in scenarios in which the genome evolves through insertions, deletions, and movements of genes along its chromosomes. In the mathematical model pioneered by Sankoff and others, a unichromosomal genome is represented by a signed permutation of a multise...

متن کامل

Models and Algorithms for Comparative Genomics

The deluge of sequenced whole-genome data has motivated the study of comparative genomics, which provides global views on genome evolution, and also offers practical solutions in deciphering the functional roles of components of genomes. A fundamental computational problem in whole-genome comparison is to infer the most likely large-scale events (rearrangements and content-modifying events) of ...

متن کامل

Exemplar or Matching: Modeling DCJ Problems with Unequal Content Genome Data

The edit distance under the DCJ model can be computed in linear time for genomes with equal content or with Indels. But it becomes NP-Hard in the presence of duplications, a problem largely unsolved especially when Indels are considered. In this paper, we compare two mainstream methods to deal with duplications and associate them with Indels: one by deletion, namely DCJ-Indel-Exemplar distance;...

متن کامل

The Normalized String Editing Problem Revisited

Marzal and Vidal [8] recently considered the problem of computing the normalized edit distance between two strings, and reported experimental results which demonstrated the use of the measure to recognize handwritten characters. Their paper formulated the theoretical properties of the measure and developed two algorithms to compute it. In this short communication we shall demonstrate how this m...

متن کامل

Adaptive Approximate Record Matching

Typographical data entry errors and incomplete documents, produce imperfect records in real world databases. These errors generate distinct records which belong to the same entity. The aim of Approximate Record Matching is to find multiple records which belong to an entity. In this paper, an algorithm for Approximate Record Matching is proposed that can be adapted automatically with input error...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011